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Abstract 

Current trends in parallel processors call for the 
design of efficient massively parallel algorithms 
for scientific computing. Parallel algorithms for 
Monte Carlo simulations of thermodynamic en- 
sembles of particles have received little attention 
because of the inherent serial nature of the statis- 
tical sampling. In this paper, we present a mas- 
sively parallel method that obeys detailed balance 
and implement it for a system of hard disks on 
the GPU. We reproduce results of serial high- 
precision Monte Carlo runs to verify the method. 
This is a good test case because the hard disk 
equation of state over the range where the liquid 
transforms into the solid is particularly sensitive 
to small deviations away from the balance condi- 
tions. On a GeForce GTX 680, our GPU imple- 
mentation executes 95 times faster than on a sin- 
gle Intel Xeon E5540 CPU core, enabling 17 times 
better performance per dollar and cutting energy 
usage by a factor of 10. 

1 Introduction 

During the last decades computational scien- 
tists have enjoyed a doubling of performance 
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for single-threaded applications every two years 
solely from improvements in computer architec- 
ture. This is no longer the case. Current processor 
designs run into the Power Wall, which limits at- 
tainable clock speeds in a given power budget, 
and the Instruction Level Parallelism (ILP) Wall, 
which exists because there is only so much ILP 
that can be extracted from a typical program. 1 
Moore's law still holds, for now, and the addi- 
tional transistors go into increasing the core counts 
on each new chip. Consequently, researchers must 
utilize parallelism to execute larger, longer, or 
more demanding calculations and simulations. 

As of this publication, a cluster of networked 
multi-core CPUs is the most common system ar- 
chitecture. In an alternative approach, a single 
graphics processing unit (GPU) can execute thou- 
sands of instructions at the same time and provides 
the performance of a small cluster at a fraction of 
the cost. 2 GPUs are becoming popular as desktop 
'personal supercomputers' and as coprocessors in 
heterogeneous clusters. A successful GPU algo- 
rithm divides a given computation into a maximal 
number of identical, fully independent, and sim- 
ple tasks, called "threads". To fully utilize their 
potential it is necessary to design not just paral- 
lel, but massively parallel algorithms that scale 
to thousands of threads. Over the last few years, 
many problems have successfully been adapted to 
GPUs. One example is the molecular dynamics 
(MD) method for simulating thermodynamic en- 
sembles of particles, which is well suited for mas- 
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sive parallelization. Numerous MD software pack- 
ages support GPUs, including HOOMD-blue, 3 ' 4 
LAMMPS, 5 AMBER, 6 ' 7 NAMD, 8 OpenMM, 9 

FEN ZI, 10 HALMD, 11 and the work by Rappa- 
1 ? 

port. 1Z 

A case where implementation to GPUs has so 
far not been achieved is Monte Carlo (MC) ap- 
plied to many-particle systems. MC is a statis- 
tical, rather than deterministic, sampling method 
that, appropriately implemented, samples the mi- 
crostates of desired thermodynamic ensembles. It 
is the method of choice in many situations because 
it only requires an interaction potential, not a force 
field, allowing e.g. the use of non-differentiable 
pair potentials. Such potentials are useful in the 
simulation of hard particles, which interact solely 
via excluded volume. MC is also flexible in the 
sense that a wide variety of update moves can 
be applied. 13 ~ 16 MC is easy to implement on se- 
rial machines where each step selects a new mi- 
crostate at random and accepts or rejects the new 
microstate based on the Boltzmann factor. An ex- 
ample trial move involves translating a single par- 
ticle in a random direction. Since the acceptance 
of a trial move depends on results of prior moves, 
subsequent moves usually cannot be performed in- 
dependently. A massively parallel algorithm must 
not only update a large number of particles at the 
same time, but also do so in a statistically correct 
way. This can be achieved by obeying detailed 
balance, which is not a trivial task. For these rea- 
sons, parallel MC codes for particle systems have 
received much less attention in the literature than 
parallel MD. 

Available parallel MC algorithms fall into sev- 
eral categories. Most of them rely on domain 
decomposition schemes to update portions of the 
problem in parallel. Lattice MC has been em- 
ployed for Ising models 17 ' 18 including GPU im- 
plementations. 19 ' 20 Related schemes for particle 
systems exist, 21 " 24 but none scale up to thousands 
of threads. Moreover, the updating of domains 
employed in some of those methods can introduce 
a sampling bias that precludes the balance condi- 
tions. Asynchronous parallel algorithms 25 ' are 
poorly suited for GPUs because of their extensive 
inter-thread communication. Hybrid approaches 
that employ MD trajectories to create trial con- 
figurations 27 ' 28 can be an effective way to exploit 
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Figure 1: In a serial hard disk simulation, each 
trial move typically consists of the following: 
(i) Select a single disk at random, (ii) apply a ran- 
dom displacement to it, and (hi) accept the move 
if it generates no overlaps. The cell list enables 
0(1) overlap checks by limiting the search space 
to only nine local cells. A sweep is defined as N 
consecutive trial moves, where N is the number of 
disks in the simulation. 

GPU parallelism, but require substantial additional 
programming and are not guaranteed to evolve any 
faster than MD alone. 

In this paper, we develop an algorithm for mas- 
sively parallel Monte Carlo (MPMC) simulations 
of many-particle systems that obeys detailed bal- 
ance. As a test case, we implement it on the GPU 
for a system of hard disks in two dimensions (Fig- 
ure 1) and validate it with comparisons to recent 
large-scale serial event-chain MC simulations. 29 
Our algorithm is not specific to this system, and it 
is valid for any MC simulation with local interac- 
tions between particles or lattice sites on any mas- 
sively parallel computer architecture. 

2 Algorithm 

In developing any parallel application, the pro- 
grammer must identify the computations that can 
be executed simultaneously. As the total work is 
broken into smaller tasks, opportunities for scal- 
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ing to more processors increase. GPUs execute 
especially fine grained work loads. In GPU MD 
applications, one thread typically acts on a single 
particle. 3 ' 5 ' 8 ' 11 ' 12 Such a decomposition is not di- 
rectly applicable to traditional MC because each 
trial move depends on the state of the neighboring 
particles. 

In MPMC, we utilize the cell list data struc- 
ture for parallel decomposition, as well as overlap 
checks in the case of hard particles. A checker- 
board decomposition permits many cells to be up- 
dated independently. 17 Although similar to ap- 
plying trial moves to particles in a particular se- 
quential order, checkerboard decomposition dif- 
fers from the serial algorithm in one key way. Par- 
ticle positions, not labels (or indices), determine 
the order of updates, so the order will change as 
particles migrate. Consequently, careless choices 
can lead to erroneous simulations. We prove that 
our implementation of MPMC obeys detailed bal- 
ance to ensure that no incorrect choices are made 
in its design. 



Algorithm 1 Monte Carlo sweep 

I: C 4— {a,b,c,d, . . .} 

2: rng.shuffle(C) 

3: for Q EC do > Loop over sub-sweeps 

4: for c E cells (Q), in parallel do > Loop 

over cells 
5: rng.shuffle(c. particles) 

6: for s E [0. . .um) do 

7: p <- 

c .particles[mod(s ,len(c .particles))] 

8: Generate trial move 

9: if p remains in cell and move ac- 

cepted then 
10: Move p 

11: end if 

12: end for 

13: end for 

14: end for 

15: d <(— rng.uniform(0,w/2) 

16: / rng. choose (—x, +x, — y, +y, . . .) 

17: shift_cells(/,d) 



2.1 Checkerboard decomposition 

The checkerboard domain decomposition scheme 17 
divides the simulation volume into sets of square 
(cubic) cells (see Figure 2). Checkerboarding 
maps well to MC simulations because it allows 
parallel updates of each set, comprising one quar- 
ter (one eighth in three dimensions) of the simu- 
lation volume. The x and y coordinates of the cell 
(and z in three dimensions) determine its checker- 
board set Q E {a,b,c,d, ...}: 



Q 



b 
c 
d 



if (x E Even) 
if (jc E Odd) 
if (x E Even) 
if (x E Odd) 



and 
and 
and 
and 



(y E Even) , 
(y E Even) , 

(ye Odd), 
(ye Odd), 



(l) 

where a, . ■ ■ indicate labels of checkerboard sets. 

The width of the cell w must be chosen greater 
than the diameter of the disk o (generally, the 
pair interaction cutoff). At the minimum w = o, 
two particles separated by one cell can move with- 
out interacting (see Figure 2(c)). Thus, the moves 
available to particles in a cell are independent from 
those in other cells of the same checkerboard set. 



Most previous parallel MC simulations with 
mobile particles use stripe domain decompo- 
sition, 21 " 23 a one-dimensional version of the 
checkerboard decomposition. The number of 
stripes and therefore the number of trial moves 
that can be conducted in parallel is low. This 
means stripe decomposition is not efficient for 
parallelization on more than a few cores. 

2.2 Sweep structure 

Algorithm 1 outlines the structure of MPMC. It 
splits each sweep over cells into sub-sweeps (four 
in two dimensions, eight in three dimensions), 
one handling each checkerboard set. Line 2 shuf- 
fles the order of checkerboard sets using Fisher- 
Yates 30 to guarantee a random permutation. Dur- 
ing a sub-sweep, the algorithm concurrently pro- 
cesses all of the cells in the active set (line 4). 

Each concurrent cell update shuffles and then 
loops over hm trial moves (lines 5,6). Line 7 se- 
lects the particle from the cell, repeating from the 
start of the list when % > n, where n is the number 
of particles in the cell. Fixing the number of moves 
in all cells to the same number distributes compu- 
tational effort most evenly among GPU cores. 31 
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Figure 2: (a) In massively parallel Monte Carlo (MPMC), trial moves are concurrently applied to particles 
in a subset of the cells. Moves that leave the cell are rejected, (b) Selected cells are separated by one row 
or one column of inactive cells. During the evaluation of the acceptance criterion, each active cell reads 
the particles in the eight neighboring inactive cells, (c) Simultaneous trial moves do not interact when the 
cell width is greater than the interaction range o. 



Line 8 generates a trial move for each particle. 
Line 9 accepts the move if it passes the normal 
Metropolis acceptance criterion 32 and the particle 
center remains in the cell. 21 Lines 15-17 main- 
tain ergodicity by performing a cell shift, which 
redraws the cell boundaries in a randomly chosen 
location. 

2.3 Detailed balance 

A MC simulation obeys detailed balance if, for ev- 
ery internal process evolving the system there ex- 
ists a reverse process occurring at the same rate. 
This ensures that a sequence of configurations con- 
verges to the correct equilibrium distribution, re- 
gardless of the initial condition. Markov-chain 
MC generates a sequence of configurations where 
the probability Xj(t + 1) of observing the system 
in state j at step t + 1 is determined only by the 
previous state i at step t. This can be expressed by 



Xj(t + l)=Xi(t)Pi 



(2) 



where x(t) = {x\(t) y X2(t), ...,x n (t)} is the proba- 
bility distribution at step t. The elements Pu of the 
transition matrix represent the probabilities that 



the system will transition from state i to state j. 
If there exists an equilibrium distribution of states 
x* for which x* = x*P, then x(t) is guaranteed to 
converge to x* as t — > °° when P satisfies detailed 
balance: 



x*P 



■ XjPji. 



(3) 



Although detailed balance of a Markov chain is 
a sufficient condition to ensure convergence, it is 
not necessary. Manousiouthakis and Deem show 
that an irreducible transition matrix that enforces 
regular sampling (3m : (P' n )ij > OVz, j) and obeys 
balance (x* = x*P) is both necessary and sufficient 
for convergence to the correct equilibrium distri- 
bution. 33 In this work we choose to enforce de- 
tailed balance for reasons discussed later in the pa- 
per. 

The MPMC algorithm constructs a Markov 
chain and obeys detailed balance on the level of 
a MC sweep. This follows directly from the obser- 
vation that for each sweep there is exactly one in- 
verse sweep, which can be seen as follows. Take a 
particular sequence of sub-sweeps and a sequence 
of hm moves (either accepted or rejected) within 
each cell. The reverse sweep consists of the re- 
verse sequence of sub- sweeps and the reverse se- 
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quence of moves within each cell, with each move 
following the negative of the original vector. For 
example, with hm = 6 trial moves per cell and n = 
4 particles in the cell, the original sequence would 
be [0,1,2,3,0,1]. There is exactly one particle 
shuffling, [1,0,3,2], that generates the reverse se- 
quence [1,0,3,2,1,0]. Since each sequence is cho- 
sen randomly from all possible permutations, the 
forward and reverse sequences occur with equal 
probability, and thus detailed balance holds. 

2.4 Pitfalls leading to incorrect statis- 
tical sampling 

Detailed balance is ensured when the following 
three steps are in place. 

1 . The particle center must not leave the cell. 21 
If particles are allowed to leave their cells 
during a sub-sweep, the reverse sequence of 
moves cannot be generated and detailed bal- 
ance is not ensured. When we skip this re- 
striction in the hard disk system, it always 
develops order in the same orientation and 
at a lower than expected density. 

2. Shuffling the particles in each cell. Parti- 
cles entering a cell are added at the end of 
the cell list and cell lists are partially main- 
tained during a cell shift. When we skip par- 
ticle shuffling, a temporal memory of previ- 
ous states builds up over many sweeps, vio- 
lating the Markov property. 

3. Shuffling the checkerboard set. Without 
shuffling of the checkerboard set, the reverse 
sweep cannot be generated, which violates 
the condition of detailed balance. 

To increase the number of accepted moves per 
sweep one might be tempted to allow particles to 
leave the cell, compensating the violation of step 
(1) by ensuring that each particle moves exactly 
once per sweep. However, this procedure does 
not guarantee balance. Cell updates with moves 
that leave a cell in the 'middle' of the cell update 
(i.e. not the first or last successful move of the sub- 
sweep into a given neighboring cell) are not re- 
versible and generate an incorrect probability dis- 
tribution. When we apply this scheme in the hard 



disk system, the pressure is shifted slightly away 
from the correct value and the magnitude of shift 
depends on the maximum trial move distance. 

The particle shuffling step (2) is often explicitly 
omitted in favor of sequential updating. 18 ' 20 > 22 ' 23 
As a justification, these authors refer to the analy- 
sis of Manousiouthakis and Deem, 33 who showed 
that shuffling is not necessary for the Ising lattice 
model away from infinite temperature. However, 
their analysis cannot be transferred to systems of 
mobile particles if the sequence of particles is de- 
termined dynamically. Our simulations for hard 
disks confirm (Table 1) that skipping the particle 
shuffling step alters the pressure close to the melt- 
ing transition. In contrast, the checkerboard set 
shuffling step (3) is not necessary for correct sam- 
pling. 

Table 1 : Simulations ofN = 256 2 hard disks with 
particle (P.) and checkerboard (CB.) shuffling en- 
abled or disabled. The comparison of equilibrium 
pressures P* for runs at three different packing 
fractions demonstrates the necessity to shuffle 
particles. Checkerboard shuffling is not required 
to obtain correct results. 



Shuffling P* in the hard disk system at 

P. CB. 0=0.698 0=0.708 = 0.716 

Yes Yes 9.17079(5) 9.18214(6) 9.1774(2) 

Yes No 9.1707(1) 9.1821(2) 9.1775(3) 

No Yes 9.1716(1) 9.1876(2) 9.1831(3) 

No No 9.1715(1) 9.1873(1) 9.1823(2) 



3 Implementation 

We implement MPMC for hard disks using the 
NVIDIA CUDA programming model and execute 
benchmarks on a GeForce GTX 680 graphics pro- 
cessor. CUDA is an established parallel program- 
ming language; details may be found in the CUDA 
programming guide, 34 text books 35 " 37 or in other 
publications, Refs. 2 ' 3 for example. The pseu- 
docode presented in this paper is general enough 
that it could be adapted to any data-parallel lan- 
guage (e.g. OpenCL or OpenMP). 
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3.1 Data Structures 

The proper choice of data structures can make or 
break an implementation's performance. We keep 
all of the architecture details of NVIDIA GPUs in 
mind when designing our implementation. MD 
codes store particles in a flat array with N ele- 
ments, and auxiliary data structures indirectly ref- 
erence this list by index. 3 ' 5 ' 11 ' 12 Such a data struc- 
ture is not appropriate for MC as it would be ex- 
pensive to rebuild the cell list every sweep. 

Instead, we store the particle positions directly 
in the cell list. That data is a sparse flat array, 
disk[jc,.y,i], with storage for m-m-n max particle 
positions, where m is the number of cells on the 
side of the simulation box and n max is the maxi- 
mum number of particles allowed in a cell. The 
auxiliary array, n[x,y] stores the number of parti- 
cles in each cell, where the particles are placed in 
elements i e [0 . . . n) . 

We minimize the number of overlap checks and 
maximize parallelism by setting the cell width w 
small, but not so small that a large fraction of 
moves will cross the cell boundaries. The size 
must also be chosen so that n max is known. Fig- 
ure 3 shows that the largest cell that can fit no 
more than four particles has a width w < y/l(7. 
Furthermore, we avoid expensive boundary condi- 
tion checks by choosing m as a multiple of 2 times 
the block size, because each thread handles every 
other cell. For the 32-thread blocks used here, we 
set m=[L/ {\/2a)\ and then round up to the near- 
est multiple of 64. 




Figure 3: Four disks of diameter o are placed on 
the corners of the cell and a fifth in the center. The 
smallest cell that can contain five disk centers has 
a diagonal of 2a and an edge length of W5 = y/2o. 
Thus, the largest cell that can contain a maximum 
of four disk centers has a width w < y/2a. 

Each cell has a local coordinate system to mit- 



igate floating point cancellation errors that would 
otherwise occur for large absolute coordinate val- 
ues. Whenever difference vectors are computed a 
coordinate system transformation is needed. The 
components of the translation vector are either 
+w, —w, or depending on the relative location 
of the neighboring cell. 

3.2 Kernel 

Algorithm 2 implements the MPMC sub-sweep 
(lines 4-13 of Algorithm 1) update in a CUDA 
kernel. One thread is launched for each cell in 
the active checkerboard set. Lines 1 and 2 com- 
pute the (x,y) index of the cell to which the thread 
is assigned. Rows in a thread block handle rows 
in the cell data. For example, threads with x ids 
0,1,2,3 are assigned to cells with x coordinates 
0,2,4,6 relative to some offset. Similarly, each row 
in a thread block is assigned to alternating y rows 
in the cell data. 

Each thread initializes its own random number 
stream (line 3). We use the Saru PRNG, devel- 
oped by Steve Worley, 38 to create uncorrelated 
random number streams from a hash of the thread 
index, current step index and a user chosen seed. 
NVIDIA s CURAND library is an alternative, but 
requires reading and writing a large state in each 
thread, which slows performance by 30%. See 
Ref. 39 for more details on the tradeoffs of various 
parallel PRNG schemes. 

Lines 4-9 read the assigned cell into shared 
memory and shuffle the particles. Line 11 starts 
a loop over % trial moves. For each selected par- 
ticle i in sequence, line 12 generates the trial move 
and lines 13-23 check for any overlaps with parti- 
cles in the neighboring cells. Lines 24-28 update 
the particle to its new position if the move gen- 
erates no overlaps and remains in the cell. Lines 
39-32 wrap i back to the start of the cell when the 
end is reached. 

When checking overlaps, each thread reads the 
eight neighboring cells. Typical GPU kernels with 
this memory access pattern use shared memory as 
a managed cache to avoid multiple reads from the 
same cell. In our case, the data size per cell is 
large and would occupy a substantial fraction of 
the available shared memory, limiting parallelism. 
Instead, we read only the current cell into shared 



6 



Algorithm 2 Sub sweep GPU kernel 

Require: gdim is (m/bdim.x/2,m/bdim.y/2) 

Require: (of f .x, of f .y) is the offset to the lower- 
leftmost active cell in the sub sweep 

Require: seed is a random number seed chosen 
by the user and fixed for the duration of a run 

Require: sweep is the index of the current 
sweep 

Require: /3 sh [p] is stored in shared memory such 
that each thread indexes unique elements in 
memory for p £ [0 . . . n max ) 
x «— 2 (bidx.x ■ bdim.x + tidx.x) + of f .x 
y «— 2 (bidx.y ■ bdim.y + tidx.y) + of f .y 
rng ^— Saru(mx + y, step, seed) 
n <S= n[x,y] 
if n == then 

return 
end if 

D sh [i\ <= disk[x,y, /] V/ £ [0 . . . n max ) 
rng.shuffle(D sh [0...n]) 

for s £ [0 . . . jim) do 

^move <- D sh [i] + rng.inCircle( d) 
overlap False 

for (x ne i g h,y n eigh) £ neighborhood of cell 
(x,y) do 

vector pointing to current neighbor 
for je [0...n max ) do 

continue when (xneigh^neigh) = 
(x,y) Ai = I 

D <= disk[x ne igh,y n eigh,;] 



Algorithm 3 Index cell data 



if 



^move -{D + S 

overlap <*— True 
end if 
end for 
end for 

if -ioverlap then 

if £> m ove £ cell (x,y) then 

■Dsh [t] <~ -Dmove 

end if 
end if 

i <(— i + 1 
if i > n then 

end if 
end for 

D sh [i] disk[x,y, i] Vi £ [0 . . . n m 



< o then 



l: procedure cell_index(x, y, i) 

2: if xe Odd then 

3: q^(x + m)/2 

4: else 

5: q^x/2 

6: end if 

7: return (i • m +y) -m + q 
8: end procedure 

memory (line 8) and use hardware cached reads 
for the neighbor accesses (line 18). 

A 128-byte wide cache line in GTX 680 fits four 
full cells in a row. However, a row-major assign- 
ment of [x,y, i] to a linear index is not ideal. Only 
one particle would be read at a time from alternat- 
ing cells in a row, using 8 out of the 128 bytes in a 
delivered cache line. A carefully chosen mapping 
from [x,y,i] indices to linear memory addresses 
leads to full utilization. Algorithm 3 implements 
that mapping. First, i is the slowest index so that 
memory instructions in loops over i in the ker- 
nel read contiguous data. The next fastest index 
is y which is handled in the traditional manner. 
The x index is the fastest, but it is rearranged so 
that all the odd, and similarly even, x values are 
contiguous in the linear space. We achieve fur- 
ther performance improvements by using texture 
reads (texlDf etch) in place of all global mem- 
ory loads. On GTX 680, the texture cache provides 
more throughput than LI for read-only data. 

Thanks to these efforts, we achieve excellent 
utilization of the available memory bandwidth. 
Benchmarks with the NVIDIA visual profiler 
show a sustained bandwidth of ~ 200 GB/s out 
of the texture cache. Achieving high occupancy 
and limiting divergence are just as important. For 
example, selecting the 16k/48k (Ll/shared) mode 
increases occupancy and boosts performance by 
50% compared to the 48k/16k mode. The loop on 
line 16 goes from to n max to boost performance 
by reducing divergent branches compared to loop- 
ing over the number of particles currently in the 
cell. We set the values of the empty particle slots 
to ( — 10,-10) so that they do not result in false 
overlaps. Early exit conditions upon finding the 
first overlap (not shown) cause additional diver- 
gent branches, but removing these checks reduces 
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Algorithm 4 Cell shift GPU kernel 

Require: gdim is (m/bdim.x,m/bdim.y) 
Require: D s h[p] is stored in shared memory such 



that each thread indexes unique elements in 
memory for p G [0 . . . n max ) 

1: procedure SHIFT_CELLS(/,rf) 
2: x <— (bidx.x ■ bdim.x) 
3: y (bidx.y ■ bdim.y) 
4: "current <= n[x,y] 

5: D sh [i\ <- (-10, -10) Vi G [0...n max ) 

6: "new <~ 

7: for i^G [0...n curren t) do 

8: D -<= diskfoy, i] 

9: D^D-fd 

10: ifD.x>0AD.y>0AD.x<wAD.y< 
w then 

11: D sh [n new ] 

12: "new ^ "new "i" 1 

13: end if 

14: end for 

15: (*neigh, ^neigh) «- cell in direction of / 

16: s<— vector pointing to neighbor 

17: "neigh n [-''•neigh jjneigh] 
18: for i G [0 . . . "neigh) do 

19: Z) disk[x neigh ,;y neigh , i] 

20: D^D-f d 

21: ifD.x>0AD.y>0AD.x<wAD.y< 
w then 

22: > Particle stays in neighbor cell, 

do nothing 
23: else 

24: D^D + S 

25: D sh [n neVi ) ^ D 

26: "new ^ "new ~\~ 1 

27: end if 

28: end for 

29: D sh [i] di sk_dbl [jc, y, z] Vi G [0 . . . n max ) 
30: n new ^> n_dbl[jc,y] 

31: end procedure 



performance due to the increase in computations 
and memory accesses. 

Kernel performance varies with block size. 3 
Short benchmarks show that (32, 4) is the fastest, 
and we use it for all production runs. It outper- 
forms the slowest by 55%, demonstrating the im- 
portance of performing this test. 

Algorithm 4 implements the cell shift step on 
the GPU by equivalently translating the particles in 
the opposite direction. One thread per cell gathers 
all of the particles that belong in the new cell and 
builds the list in shared memory. It then writes out 
the cell to a separate memory area, disk_dbl and 
n_dbl, so that other running threads do not read 
updated data. After the kernel completes, the dou- 
ble buffered data structures are swapped. Since the 
shift direction is one of either +x, —x, +y, or —y, 
only two old cells contribute particles to the new 
cell: the cell with the same index, and one neigh- 
bor. Lines 7-14 loop over the current cell, read 
in each particle, shift the cell, and if that particle 
is still within the cell boundaries it is added to the 
new list. Lines 18-28 perform the same operations 
on the neighbor cell, with the addition of a coordi- 
nate system transformation. These rely on the fol- 
lowing logic: if the particle left its host cell, it must 
have entered this cell. In this manner, each particle 
is checked for inclusion by two threads and at two 
separate points in the code. 

The floating point operations by both of these 
threads must be identical. Consider if line 20 
were to perform the coordinate system translation 
D — f ■ d + s and then check for particles that en- 
ter the current cell. Floating point round-off errors 
may result in both this check and the correspond- 
ing check on line 10 to fail, losing the particle. In 
a test simulation configured with 1024 2 particles, 
several hundred were lost after 10 sweeps. When 
implemented as shown in Algorithm 4, no parti- 
cles are lost even after 10 9 sweeps. 

3.3 Parameter tuning 

The maximum move radius d and the number 
of trial moves performed in each cell update hm 
are free parameters. At fixed hm, we test d G 
[0.06, 0.08,... 0.20] and at fixed d, we test n M G 
[1 ... 8]. Each test measures the autocorrelation of 
the average orientational order parameter 40 over a 
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long run of 10 9 sweeps for N = 512 2 . We find that 
d = 0.16 and um = 4 minimizes the autocorrela- 
tion time T when measured in wall clock seconds. 

4 Results 

The hard disk system is a standard model sys- 
tem in statistical mechanics and the one which 
was originally used to pioneer the Monte Carlo 
computer simulation method. 32 Its phase behav- 
ior is completely determined by the equation of 
state, which is the relation between internal pres- 
sure P and packing fraction <j) = pn(o /2) 2 . Here, 
p = N/V is density and V the volume of the simu- 
lation box. At packing fractions between <j) = 0.7 
to 0.72 the system undergoes a first-order phase 
transition from the liquid phase to the hexatic 
phase, followed by a continuous transition to the 
solid phase. 29 In their paper, Bernard and Krauth 
showed with an event-chain simulation method 16 
that only very large simulations (N > 256 2 ) have 
minimal finite size effects, and long equilibration 
times are necessary. This demonstrates the need 
for high performance MC code and explains why 
previous simulation studies of the hard disk sys- 
tem 41 " 46 have been unable to provide conclusive 
evidence for a first-order phase transition or the 
existence of an intermediate hexatic phase. 

Phase transitions are extremely sensitive to the 
slightest programming error or inadvertent cor- 
relation of trial moves due to the appearance of 
(quasi-)long-range spatial correlations and long 
equilibration times. Structural fluctuations are 
more important in two dimensions than in three 
dimensions, and they are particularly large for 
the hard disk system. These properties, together 
with the availability of high-precision serial data 
to compare with, make hard disks a good system 
to test our algorithm. 

In principal, computing the pressure for hard 
disks is simple: Estimate the radial distribution 
function g(r) in the limit as r approaches the disk 
diameter o from the right, and calculate 32 

P* = — = a 2 p(\ + -a 2 p Bin g(r) , (4) 
k^l \ 2 r^a+ J 

where we introduce the dimensionless pressure 
P* . In practice, obtaining an unbiased estimate re- 



quires special care. We estimate pressure using the 
following procedure. For each sampled configura- 
tion, a histogram of particle pair distances is com- 
puted over all N particles. n[ri\ counts the number 
of particle pairs between r, and r, + 8r. The pair 
distribution function g(r) is evaluated by the equa- 
tion 

P ( R ) = = ,cs 

8{ l) NpdA NplnRidr K ) 

at the sampling points r = Ru where 

2r i+ i 3 -n 3 

R i=~ 9 2" 

The formula for Ri is derived assuming a linear 
dependence of n(r) with r, which we observe to 
be valid close to o. The minimum position Rq is 
greater than a, so direct evaluation of the limit is 
not possible. We fit g(R{) to a polynomial of de- 
gree d in the range re (a, <J+c] and extrapolate to 
r = a. Extensive testing with a model distribution 
(n(r) = 30e~ 30r ) tunes the parameters to ensure 
that there is no systematic bias. We choose param- 
eters in the middle of the flat region with system- 
atic errors less than 10~ 5 . They are 8r = 10 c, 
c = 0.02a, and d = 5. 

Table 2 shows the average P* data obtained by 
simulations with MPMC. Independent runs ofN = 
256 2 , N = 512 2 , and N = 1024 2 particles are per- 
formed at packing fractions between = 0.698 
and = 0.718, which comprises the transforma- 
tion from liquid to solid. Each run starts with a 
randomly generated configuration at low density 
and is quickly compressed to the target. The run 
then continues at constant density for 10 9 sweeps, 
which only takes 4 days for /V = 256 2 (15 days for 
N = 512 2 ) to complete on a Tesla M2070 GPU. 
The pressure is averaged every 200 sweeps in each 
run after an equilibration period of 3 • 10 8 sweeps. 

Our data confirms the equation of state reported 
in Ref. 29 All values overlap within error bars. We 
do not analyze positional order or orientation or- 
der in the dense phase emerging from the phase 
transition. Such an analysis would be necessary to 
distinguish a hexatic phase from a solid phase and 
is left for a separate work. 40 We refer to Ref. 41 
for an in-depth comparison of the phase diagram 
of hard disks obtained with various algorithms, in- 
cluding MPMC. 
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Table 2: This table shows data for the equation of state P*((j)) over the range where the liquid 
transforms into the solid. It includes runs by MPMC (this work) and by Bernard and Krauth (BK) 
with serial event chain simulation. 29 Error bars are shown at two standard errors of the mean, o = 
2(([P* — (P*)] 2 ) /A^ sam pies ) ! 7/2 , where the number in parentheses is the error in the last digit shown. Our 
data is averaged over 8 independent runs of 10 9 sweeps (64 runs for = 0.718, /V = 256 2 ). See Ref. 29 for 
a description of the BK data averaging scheme. Differences in the pressures are shown with propagated 
error bars (cTmpmc + ^bk) 1/2 • 



System 
size 



Method 



Dimensionless pressure P* in the 
= 0.698 (j> = 0.702 (j) = 0.706 



hard disk system 
= 0.710 



at packing 
= 0.714 



fraction 
(j) =0.718 



N = 


-256 2 


MPMC 


9.1709(1) 


9.1920(2) 


9.1854(1) 


9.1792(1) 


9.1758(1) 


9.187(1) 






BK 


9.1708(4) 


9.1924(4) 


9.1858(5) 


9.1790(4) 


9.1758(5) 


9.186(1) 






Difference 


0.0000(5) 


0.0004(5) 


0.0004(5) 


0.0002(4) 


0.0000(6) 


0.001(1) 


N = 


: 512 2 


MPMC 


9.1699(5) 


9.1900(2) 


9.1861(1) 


9.1828(1) 


9.1800(1) 


9.1930(4) 






BK 


9.1700(2) 


9.1899(6) 


9.1856(6) 


9.1821(5) 


9.1803(4) 


9.1937(2) 






Difference 


0.0001(5) 


0.0001(6) 


0.0004(6) 


0.0007(5) 


0.0003(4) 


0.0006(5) 


N = 


1024 2 


MPMC 


9.1694(1) 


9.187(1) 


9.1858(3) 


9.1843(5) 


9.1819(7) 


9.21(1) 






BK 


9.1693(1) 


9.1880(2) 


9.1855(2) 


9.1843(2) 


9.1822(2) 


9.1949(3) 






Difference 


0.0001(2) 


0.001(1) 


0.0002(4) 


0.0000(5) 


0.0003(7) 


0.01(1) 



5 Performance 

We test the performance of the MPMC hard disk 
code using CUDA on a single GTX 680 GPU 
(Kepler). For comparison, we also implement 
the MPMC algorithm on the CPU and run it on 
our cluster nodes. The CPU implementation uses 
OpenMP to parallelize across all cores on a node. 
Each thread processes a single horizontal strip of 
the simulation domain and the innermost loop fol- 
lows Algorithm 2. We make no attempt to par- 
allelize across multiple GPUs or multiple CPU 
nodes using MPI. All tests are performed using 
single precision floating point format to store par- 
ticle coordinates. Table 3 lists complete specifica- 
tions of the hardware and software configurations 
of our test machines. 

5.1 Scaling with number of particles 

We perform benchmarks at a fixed packing frac- 
tion = 0.698 and analyze the performance scal- 
ing with N. Figure 4(a) plots the results and Ta- 
ble 4 lists selected numerical values. 

The algorithm has a running time t £ 0(N), so 
the number of trial moves per unit time, rj , should 
be constant under ideal circumstances. In practice, 



there are overheads that cause deviation from con- 
stant efficiency. Figure 4(b) collapses the individ- 
ual rj plots to a relative efficiency metric for each 
separate configuration. When running on a sin- 
gle CPU core, efficiency is flat with only a slight 
downward drift for large N. This is likely because 
the largest systems no longer fit in on-chip cache. 
The 8-core benchmarks show the same behavior 
at large A^, although it starts a factor of 2 higher 
because the same data is now split over 2 chips' 
caches. The 8-core results also have a slight dip for 
< 10 5 due to the overhead of managing worker 
threads. On the GPU, the kernel launch overhead 
is significant enough that for N less than 10 5 , effi- 
ciency is less than 70%. 

Despite this inefficiency, the GPU still outper- 
forms the single CPU runs by a factor of 82 for 
systems as small as N = 253 2 . At peak efficiency, 
the GPU speedup over a single core is a factor 
of 95. We prefer thinking about speedups com- 
pared to a single core, but recognize that there are 
other ways of evaluating it. On a per socket ba- 
sis, a single GPU is 24x faster than a quad core 
E5540. On a per-node basis, it is still an order of 
magnitude (12x) faster. In terms of aggregate per- 
formance per price, an 8-core node with 8 exter- 
nally attached GPUs (the configuration we use) is 
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Table 3: Benchmark hardware and software configurations. 





GTX 680 


E5540 


Hardware 


custom built 


HP DL2xl70h 


Mainboard 


ASUS Sabertooth 




Chipset 


AMD 990r"X 


Intel 5520 


CPU 


AMD Athlon 11 X4 630 


/"\ T.I "\/" I j C C A f\ 

2x Intel Xeon E5540 


f ' | } [ T . 1 .... 1 , 

CrU clock 


z.o Lrriz 


Z.3iLrrlz 


RAM 


4GB DDR3 


24GB DDR3 


PvAM clock 


1333 MHz 


1333 MHz 


CPU 


Gerorce G1X 680 




Core clock 


1006 MHz 




Memory clock 


oUUo JVlHz 




HD AHA 

DKAJV1 


z (jd (jDDKj 




OS 


Gentoo 


RHEL6 


Architecture 


x86_64 


x86_64 


CPU compiler 


GCC 4.5.3 


GCC 4.7.0 


flags 


-03 -funroll-loops 


-03 -funroll-loops 


GPU compiler 


CUDA 4.2 




flags 


default 




GPU driver 


295.59 





10 4 



(a) 



□ □ n^tRfti □□cninxtniiBnngii 

o GTX 680 

o E5540 (8 cores) 

a E5540(l core) 

A A |A A aAaAa A AAAAAAAAaAaAAA^W^A 



10 5 10 6 
N 



10 




o 



10 5 10 6 
N 
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Figure 4: Benchmarks are performed with a varying number of disks, N, at a packing fraction = 0.698. 
Each benchmark runs 100 sweeps to warm up and then measures the time it takes to run another 100 
sweeps as well as the the number of attempted trial moves to compute r\ = N moves /t. 11 separate runs 
are performed and the median result is plotted here. In all cases, the error bars (one standard deviation) 
are smaller than the symbol size. Subfigure (a) plots the efficiency of the computation r\ vs. N on various 
hardware configurations. Subfigure (b) plots 77 normalized by the maximum obtained on each hardware 
configuration. Subfigure (c) plots the speedup obtained over a serial execution. The dashed line marks a 
speedup of 8. 
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Table 4: Numerical values for selected benchmarks from Figure 4. 







Z33 


/oU 




1 <Of>2 
1 jZU 


JU4U 


GTX 680 


n 


6.23 • 10 8 


6.95- 


10 8 


6.66- 10 8 


6.41 -10 8 




7}/max(r\) 


0.895 


1.00 




0.96 


0.920 




1 / ^/serial 


OZ.O 


O/l c 

y4.o 




on o 

yu.y 


/.Z 


E5540 




5.77- 10 7 


5.78- 


10 7 


5.42- 10 7 


5.26- 10 7 


{o cores ) 


77/max(^j 








U.oDU 


U.oZU 




V/V serial 


7.65 


7.88 




7.39 


7.16 


E5540 




7.54- 10 6 


7.33- 


10 6 


7.33 -10 6 


7.35- 10 6 


(1 core) 


T]/max(77) 


0.880 


0.855 




0.855 


0.858 



95x faster than just the host, but only costs 5.5x as 
much for a benefit of 17x more sweeps/unit time 
given a fixed budget. We do not have power mon- 
itoring equipment on our cluster, so we are unable 
to provide actual measurements of power savings. 
Instead, we obtain an estimate using the manufac- 
turer's TDP specifications, which report 80W for 
the E5540 and 195W for the GTX 680. Based on 
these numbers and the per-socket speedup, a CPU 
simulation would use 9.8 times more energy than 
if it were run for the same number of sweeps on 
the GPU. 

5.2 Limitations 

The maximum number of parallel threads depends 
on the cell size, interaction range, and particle 
shape. It is approximately N/S for hard disks at 
high density. Hard spheres decrease the maximum 
to N/16. Expanding the interaction range to a 
truncated and shifted Lennard- Jones potential with 
r cut = 2.5o~ drops the maximum to N/144. GPUs 
operate at peak efficiency only when running more 
than 10000 threads, establishing minimum practi- 
cal system sizes of 80 thousand, 160 thousand, and 
1.44 million particles for hard disks, hard spheres, 
and Lennard-Jones beads, respectively. Depend- 
ing on the desired application, these sizes may 
be unnecessary or prohibitively large. A moder- 
ately threaded CPU implementation does not suf- 
fer from this problem. 



6 Conclusions 

Efficient parallel algorithms are essential for the 
application of Monte Carlo particle simulations on 
current and future computer hardware. Building 
on prior works utilizing checkerboard domain de- 
composition, we detailed an algorithm for mas- 
sively parallel MC and implemented it on the GPU 
and the CPU. The GPU speedup that we obtain 
is comparable to what has been achieved in MD 
simulations. Our findings demonstrate that GPUs 
are well- suited for running MC simulations when 
the number of particles is high enough. While the 
small system size overhead is the only restriction 
of our method, it can be significant if one is inter- 
ested in the behavior of systems of that size. 

During the course of this work we learned that 
it is surprisingly difficult to implement MC in a 
parallel environment. Every slight violation of 
the balance conditions leads to incorrect sampling, 
and with parallel update moves it is not simple 
to determine which schemes do not obey balance. 
Sometimes the effect on our simulations was so 
small that it would be easy to miss in a typical 
complex practical application. It is important to 
test any new parallel algorithm thoroughly in a sit- 
uation where reliable results are available from se- 
rial simulations. Hard disks are such a system. We 
computed to high precision the hard disk equation 
of state over the density range where the liquid 
transforms into the solid. Our results agree per- 
fectly with serial event-chain Monte Carlo simu- 
lations and are compatible with the presence of 
a first-order liquid-hexatic phase transition 40 - a 
finding that is scientifically significant by itself. 
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